Finding Topics in Emails: Is LDA enough?

نویسندگان

  • Shafiq Joty
  • Giuseppe Carenini
  • Gabriel Murray
  • Raymond Ng
چکیده

Our research addresses the task of finding topics at the sentence level in email conversations. As an asynchronous collaborative application, email has its own characteristics which differ from written monologues (e.g., text books, news articles) or spoken dialogs (e.g., meetings). Hence, the generative topic models like Latent Dirichlet Allocation (LDA) and its variations, which are successful in finding topics in monologue or dialog, may not be successful by themselves in asynchronous written conversations like emails. However, an effective combination of LDA with other important features can give us the desired results. We first point out the specific characteristics of emails that we need to consider in order to find the inherent topics discussed in an email conversation. Then we demonstrate why the generative topic models by themselves may not be adequate for this task. We propose a novel graph-theoretic framework to solve the problem. Crucial to our proposed approach is that it captures the discriminative email features and integrates the strengths of the supervised approach with the unsupervised technique considering LDA yet as one of the important factors.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

یک مدل موضوعی احتمالاتی مبتنی بر روابط محلّی واژگان در پنجره‌های هم‌پوشان

A probabilistic topic model assumes that documents are generated through a process involving topics and then tries to reverse this process, given the documents and extract topics. A topic is usually assumed to be a distribution over words. LDA is one of the first and most popular topic models introduced so far. In the document generation process assumed by LDA, each document is a distribution o...

متن کامل

Topic words analysis based on LDA model

Social network analysis (SNA), which is a research field describing and modeling the social connection of a certain group of people, is popular among network services. Our topic words analysis project is a SNA method to visualize the topic words among emails from Obama.com to accounts registered in Columbus, Ohio. Based on Latent Dirichlet Allocation (LDA) model, a popular topic model of SNA, o...

متن کامل

Exploiting Conversation Structure in Unsupervised Topic Segmentation for Emails

This work concerns automatic topic segmentation of email conversations. We present a corpus of email threads manually annotated with topics, and evaluate annotator reliability. To our knowledge, this is the first such email corpus. We show how the existing topic segmentation models (i.e., Lexical Chain Segmenter (LCSeg) and Latent Dirichlet Allocation (LDA)) which are solely based on lexical in...

متن کامل

Variable Selection for Latent Dirichlet Allocation

In latent Dirichlet allocation (LDA), topics are multinomial distributions over the entire vocabulary. However, the vocabulary usually contains many words that are not relevant in forming the topics. We adopt a variable selection method widely used in statistical modeling as a dimension reduction tool and combine it with LDA. In this variable selection model for LDA (vsLDA), topics are multinom...

متن کامل

Exploiting Conversation Features for Finding Topics in Emails

Our ongoing research addresses the task of finding topics at the sentence level in email conversations. We first describe how the existing topic models can be applied to this problem. Then we demonstrate why the existing methods are inadequate for this task and what more we need to consider. With an experiment we further show that conversation structure in the form of fragment quotation graph c...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009